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BAYESIAN VARIABLE SELECTION FOR HIGH DIMENSIONAL 
GENERALIZED LINEAR MODELS: CONVERGENCE 
RATES OF THE FITTED DENSITIES 

By Wenxin Jiang 

Northwestern University 

Bayesian variable selection has gained much empirical success re- 
cently in a variety of applications when the number K of explanatory 
variables {xi, . . . ,xk) is possibly much larger than the sample size n. 
For generalized linear models, if most of the Xj 's have very small ef- 
fects on the response y, we show that it is possible to use Bayesian 
variable selection to reduce overfitting caused by the curse of dimen- 
sionality K ^ 71. In this approach a suitable prior can be used to 
choose a few out of the many Xj's to model y, so that the poste- 
rior will propose probability densities p that are "often close" to the 
true density p* in some sense. The closeness can be described by a 
Hellinger distance between p and p* that scales at a power very close 
to n^^^^ , which is the "finite-dimensional rate" corresponding to a 
low-dimensional situation. These findings extend some recent work 
of Jiang [Technical Report 05-02 (2005) Dept. Statistics, Northwest- 
ern Univ.] on consistency of Bayesian variable selection for binary 
classification. 

1. Introduction. Bayesian variable selection (BVS) is a fruitful method 
for studying regression models that relate a response y to a vector of candi- 
date explanatory variables x = {xi, . . . , ■ For example, when generalized 
linear models (GLM) are considered, the density of y and the mean function 
of y conditional on x both depend on a linear combination x'^ (3 through 
the regression coefficients /? = . . . ,I3kY'- The BVS approach uses priors 
that propose different model 7's and the corresponding sets of regression 
coefficient /J^'s, where 7 indicates the components of x that are included 
in regression. The posterior distribution Ti^y , (3-y\D^] for the model and the 
model parameters {"fyf^^^y) can then be obtained based on an observed data 
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set = which is often assumed to consist of i.i.d. (independent 

and identically distributed) copies of {x,y). Computational simplification is 
achievable in the cases of linear regression and probit regression, where the 
unknown regression coefficients f3^ can often be analytically integrated out 
in the posterior-based computations (e.g., Kohn, Smith and Chan [17]; Lee 
et al. [19]). 

The BVS approach has had many successful applications. For example, 
when applied in a linear regression framework, BVS is used in basis selection 
for nonparametric regression (e.g.. Smith and Kohn [23], Kohn, Smith and 
Chan [17]) and in construction of financial index tracking portfolios (e.g., 
George and McCulloch [7]). Other work applying BVS in the GLM frame- 
work includes, for example, Clyde and DeSimone-Sasinowska [3], Nott and 
Leonte [21] and Wang and George [24]. Recently, BVS has been applied to 
the area of bioinformatics. In order to construct Gaussian graphical models 
for gene expression pathways, Dobra et al. [5] obtain biologically meaningful 
results by applying Bayesian variable selection to model how each gene in 
the graph relates to tens of thousands of other genes. In order to classify 
binary responses based on microarray data, Lee et al. [19] and Sha et al. 
[22] (via probit regression) and Zhou, Liu and Wong [27] (via logistic re- 
gression) use BVS to achieve excellent cross- validated classification errors. 
These most recent applications are especially noteworthy since they are all 
in the situation of A' ^ n, where the number of candidate variables K can 
be several thousand and the sample size n is often less than a hundred. 

Despite these empirical successes, there has not been a systematic study of 
the frequentist properties of BVS, such as posterior consistency and conver- 
gence rates. It is the aim of this paper to study these convergence properties 
for BVS, allowing K to be possibly much larger than n. The consistency 
that we will consider is neither the traditional sense (i) of consistency in 
estimating the true regression parameters, nor the sense (ii) of consistency 
in identifying the true model (the rr-components with nonzero regression 
coefficients). Sense (i) is not feasible since in cases with K ^ n, the /?- 
coefficients are often not identifiable. Sense (ii) is not a totally satisfactory 
framework when, as in many realistic situations, none of the K regression 
coefficients is exactly zero, even though many of them may be very small. 
The consistency we consider is the closeness between the true (conditional) 
density =p*{y\x) and the densities p = p{y\x; j , (3^^) proposed by the pos- 
terior 7r(7, /3^|Z)"). We do not attempt to identify the "true parameter" or 
the "true model" (the nonzero coefficients). Rather, we allow all coefficients 
to be not exactly zero, and attempt to construct the posterior to propose 
models that include only a few of those nonzero coefficients, but have the 
corresponding densities p "often close" to the true p* in some sense. 

Let Vx{dx) be the probability measure for x and Vy[dy) be the dominat- 
ing measure for conditional densities p and p* . Define the Hellinger distance 
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between p and p* as d{p,p*) = y J J Vy{dy)i/x{dx){^ — ^/p*)'^. The conver- 
gence results we consider describe the "often closeness" between p and p* 
that can be formulated as, for example, 

(1) P*[TT[d{p,p*) < en\D^'] > 1 - <^„,] > 1 - 

for all large enough n, for some small en,5n, A„ converging to zero as n — > oo, 
where P* is the probability measure for data D^, when they are generated 
as i.i.d. copies from p* Uy{dy)i'x{dx) . 

It is noted that BVS is essential for achieving convergence results as above, 
when X ^ n. A usual approach without variable selection, using the full 
model and putting a prior on all the regression coefficients, can be shown to 
lead to bad results in the following counterexample. 

Example. Suppose K > n. Let the random variable z take values from 
{j /K}^^^ with equal probability and let x = (xi, . . . , xk)'^ , where Xj = I[z = 
j/K] for each j. Let {z^^\x^^\y^'^^)f^i and {z,x,y) be i.i.d., where ~ 
A^(0, 1). Suppose the fitted model is y\x ~ N(J2f=i Pj^j^ 1)) where, without 
selecting among the , one proposes a prior for as i.i.d. A''(0, 1). 

The Hellinger distance d{p,p*) between p* = /^/\/27r and 
p = (T^y-Y^Ul^i'^^fl'i ig such that = (21 K) - e"^^'/^). 

Then, in the posterior conditional on Z?" = (x*^*-* , y )^]^ , /3i , . . . , 13k are in- 
dependent and /?j ~ A^(Er=i V^^|(^ + Eti 1/(1 + ELi xf)). Note 
that [x^^yi^^ are zero for at least K — n of the K j's since x^j^ = I[z^^^ = 
j/K], and the n z^^^'s can only populate at most n out of K of the j/K 
locations. Therefore, at least K — n out of the K /3j's follow the A^(0, 1) dis- 
tribution in the posterior — which is the same as the corresponding prior dis- 
tribution. Without loss of generality, let be independent A^(0, 1) in the 

posterior. Note that d{p,p*)^ > [2{K - n)/ K][l/{K - n)]J2f-''{l - e~^J^^). 
For a simple example, let K = 2n. An application of Chebyshev's inequal- 
ity leads to 7:[d{p,p*) > r/^/^lD"] > 1 - (ry^n)"!, for = 1/2 - 1/^5, which 
happens with P*-probability 1. Therefore, without variable selection, a con- 
vergence result such as (1) cannot hold. Such a convergence result, however, 
can be shown to hold for this example, with e„ following a near finite- 
dimensional rate (a power close to 1/ ^/n) , if Bayesian variable selection is 
used properly, according to later results of this paper (e.g.. Remark 1). 

There has been considerable interest recently in studying the theoretical 
properties of high-dimensional regression. Most results are for frequentist 
methods. For example, Biihlmann [2] considers boosting for high-dimensional 
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regression; Greenshtein and Ritov [12] and Greenshtein [11] consider con- 
strained or £i-penalized optimization; Meinshausen and Biihlmann [20] ap- 
ply a similar method of ii penalization to high-dimensional graphical mod- 
els. Recently, Fan and Li [6] have provided a useful overview for methods 
based on the penalized likelihood for treating high dimensionality, which 
includes examples of generalized linear models and survival models, among 
others. In contrast to these frequentist approaches based on optimization, 
the Bayesian method considered here has the attractive capability of pre- 
senting several likely models together with the corresponding posterior prob- 
abilities. A theoretical study of Bayesian inference without variable selection 
has been carried out by, for example, Ghosal [8, 9]. This work considers K^s 
growing with n but at a slower rate. On the other hand, in the K <^n case 
treated in Ghosal [8, 9], posterior asymptotic normality was established for 
the whole parameter vector, so the goal there was much higher, and hence, 
the result there is not comparable with the result in the present paper which 
focuses on posterior convergence rates. 

In contrast to previous work, we consider Bayesian variable selection and 
allow the cases K ^ n. It is noted that it is essential to have the variable 
selection step in order to obtain good results when K ^ n. The counterex- 
ample above shows that, without variable selection, it is impossible to have 
good convergence in general cases with K > n, while with variable selection, 
excellent empirical performance has been reported, for example, in Lee et 
al. [19] and Sha et al. [22] with K > n. 

We study the convergence behavior of BVS for generalized linear models, 
which include linear regression, logistic regression, probit regression, Pois- 
son regression, and so on. We also include a discussion of Gaussian graphical 
models that uses linear regression for neighborhood selection. Therefore, the 
current paper forms an extension to Jiang [15], who only considers consis- 
tency of BVS for binary logistic and probit regression, without studying the 
convergence rates. Here we study the convergence rate e„ as well, and will 
show that despite the high-dimension K ^ n, BVS can still lead to a near 
finite-dimensional rate (with e„ close to l/^/n in order), if we are in some 
"sparse" situations when most of the regression coefficients are very small. 
(For binary regression, this rate En also forms a good convergence rate for 
the purpose of classification, as shown in Section 5 later.) For such sparse 
high-dimensional problems, Bayesian variable selection can therefore help 
to reduce "overfitting" or the "curse of dimensionality." Note that such a 
conclusion can only be drawn by a careful study of the convergence rates 
in high dimensions; just proving the consistency, as in Jiang [15], is not 
enough; for example, it is well known (e.g., Hastie, Tibshirani and Friedman 
[13], Chapter 13) that the /c-nearest neighbor rules are consistent for classifi- 
cation, but can suffer considerably from the curse of dimensionality. Also, it 
is well known (e.g., Devroye, Gyorfi and Lugosi [4], Chapters 6 and 7) that 
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(at least in finite dimensions) there exist universally consistent classification 
rules, but any rule can have a very slow convergence rate under some data 
distribution. 

Below we will first specify the notation and the framework of the paper. 

2. Notation and framework. The explanatory variable is a Kn-dimen- 
sional random vector x = {xi,X2, ■ ■ ■ ,XKn)^ ■ Following the typical practice 
of studying high-dimensional problems, we will formally consider the asymp- 
totics when increases as n — > oo . 

For simplicity, we will assume that \xj\ < 1 for all j for most of the later 
discussion. The results can be easily extended to the case when all | Xj\ s are 
bounded above by a large constant. 

The response is y. The true relation between y and x is assumed to follow 
a parametric generalized linear model (GLM) with true conditional den- 
sity p*{y\x) and the corresponding mean function Generalized linear 
models (GLM) are a class of popular regression models relating a response 
y to a vector of covariates x. The GLM with one natural parameter is con- 
structed with a density of the form p*{y\x) = exp{a(/i*)?/ + h{h*) + c(y)} = 
f{y,h*), where h* = x'^f3* is the linear parameter, a{h) and b{h) are contin- 
uously differentiable, and a{h) has nonzero derivative. The mean function 
/i* = E{y\x) = —h'{h*)/a'{h*) = ip{x'^ (3*) follows atransformed linear model, 
where the transform is the inverse of a chosen link function. This formal- 
ism includes regression models for responses that are binary, Poisson and 
Gaussian (with known error variance), and can be easily extended to the 
cases with a dispersion parameter, which can then include Gaussian models 
with unknown error variance. 

We assume that corresponding to the true model p* , there exists a true 
regression parameter vector /?*, which satisfies some "sparseness" condi- 
tions, describing situations when most components of /?* are very small in 
magnitude. One such condition states that lim^^oo j^i I /^j I < Other 
conditions can be formulated to describe how fast the sum of |/3j|'s con- 
verges. 

The condition limiting the sum of | 's has been considered by Biihlmann 
[2] for studying how boosting algorithms handle high-dimensional linear re- 
gression. As Biihlmann points out, as a special case, this condition is satisfied 
when only a finite and fixed number of relevant, that is, when the 

number of nonzero /3| 's is independent of n. More generally, the sparseness 
conditions can describe situations when all Xj's are relevant, but most of 
them have very small effects (|/3||'s). 

Note that p* is the conditional density of y\x, which is also the joint 
density of {x,y) if the dominating measure Vx{dx)vy[dy) is the product of 
the probability measure of x and the dominating measure of y. We will 
always use this kind of dominating measure. 
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The data for n subjects are assumed to be independent and identically 
distributed (i.i.d.) based on p* i>x{dx)vy{dy) . Therefore, showing the subject 

index i, the data set is of the form = (x^*\ . . . 

The prior selects a subset of the Kn x-variables in the data set to model y, 
using density f{y, x^/?^), where 7 = (71, . . . , 7i<-„) has 0/1 valued components 
which are 1 only when the corresponding x component is included in the 
model, that is, 7^ = /[|/3j| > 0]. Sometimes we will also use 7 to denote the 
corresponding set of index j's for which > 0. The notation v-y denotes 
the subvector of a vector v with components {vj}, for all j's with jj = 1 (or 
for all j S 7, if 7 is understood as the corresponding index set). 

We use the probability measure 7r„(7, dp.y) to denote the prior distribution 
of the subset model 7 and the corresponding regression coefficients /3-y. (The 
prior depends on the sample size n, but we will often drop the subscript n 
for simpler notation.) This induces a posterior measure conditional on the 
data set D"-, 

7r{^,d(3^\D^) 

n „ n 

i=l 7' "^^y i=l 

where p{y , x\'j , j3-y) =f{y,x'^P^). The prior and posterior distributions for 
(jiP-y) induce distributions for the corresponding parameterized densities. 

For notational simplification, we will use |f | to denote the sum of the ab- 
solute values of the components for any vector v. For two positive sequences 
a„ and 6„, a„ -< 5„ (or 6„ y an) means hm„_»oo On/bn = 0. 

3. A prior specification. General conditions on the prior will be given 
later in Section 7. Here, for being specific, we first consider the following 
prior for (7,/?^). Conditional on 7, f3^ follows A^(0, Vy), where is a I7I x I7I 
covariance matrix. 

To complete this prior specification, we let the model indicators 7 = 
(71, . . . ,^Kn) be generated by first proposing i.i.d. binary random variables 
7", with 7r(7j = 1) = An = rn/Kn, where we assume, for convenience, that 
r„ is some integer smaller than Kn- We then keep only the 7's satisfying a 
size restriction J2 lljl — ^nj and let the prior proposed model 7 = 7. Here r„ 
is the prior expectation of model size I7I before applying the size restriction; 
fn is the maximal possible model size. We assume 1 < r„ < f„ < Kn- 

Therefore, 71(7) oc Jlj^i An' (1 - K)^"^' I[Ef=ili < Tn]- Although the size 
restriction is not necessary (see more general conditions in Section 7) , it helps 
to keep the model from becoming too complicated and gives a convenient 
starting point for proving the theoretical properties. Also, without this kind 
of restriction, the design matrix J2?=i ^i-y^J^y would become singular when 
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the proposed model size I7I > n; such a design matrix is often used in the 
popular algorithms for generating the posterior distributions in Gaussian 
regression (e.g., Smith and Kohn [23]), probit regression (e.g., Lee et al. 
[19]) and logistic regression (e.g., Zhou et al. [27]). 

Under this specification of prior, we will present conditions on V^, rn and 
fn for proving results on posterior consistency and convergence rates. The 
condition on V^y will depend on how the largest eigenvalues (chi) of V-y and 
grow with the size of |7|. 

Let -^(7) = max{chi(V^), chi(Ky~^)}. In many typical cases -^(7) grows 
at most polynomially in model size, that is, -^(7) < -BI7I'' for some constant 
B >Q and some power t; > 0, for all large |7|. For example, when = cl^ 
(proportional to the identity matrix; see, e.g., Dobra et al. [5]), -^(7) is a 
constant and does not grow with |7|. For another example, ~ constant x 
(-Ex^x!^)"^; see, for example. Smith and Kohn [23] and Lee et al. [19], who 
use a sample approximation of this choice. Then the largest eigenvalues of 
and are both bounded linearly for large I7I, when has components 
standardized to have mean zero and common variance, and have all pairwise 
correlations being p G (0,1). In addition, for following the covariance 
matrix of a finite-order AR or MA process, when the lag polynomials have 
no zeros on the unit circle, the eigenvalues of and are also bounded 
such that max{chi(V^),chi(y^~^)} grows like |7|°. For a detailed discussion 
of these eigenvalues, see, for example, Section 3 of Bickel and Levina [1]. 

4. Convergence results for GLM. Here, for simplicity we will assume 
that all explanatory variables are bounded and standardized such that \xj\ < 

1 for all j. Assume lim„^oo X]f" < fo'^ ^ regression parameter (3* 
corresponding to the true density p*, where Kn is a nondecreasing sequence 
in n. 

We also assume that the prior specification in Section 3 is used. Define 
A(r„) = inf^.|^l=^^^^j-.j.^^ 1/3*1, S(r„) = sup^.|^|=,,^ chi(T/~i) and -B(r.„) = 

sup^.|^l=^^ chi(Ky). Let Bn = sup^.|^|<j;^ chi(Ky). Let D{R) = 1 + x 
sup|h|<ij|a'(/i)| •sup|;j|<j^|^(/i)| for any R>i). 

Theorem 1. Assume that the prior specification in Section 3 is used, 
\xj\ < 1 for all j and lim„^oo Sf" \ f^j \ < ^> where Kn is a nondecreasing 
sequence in n. 

Let En be a sequence such that £n £ (0, 1] /or each n and ne^ >~ 1 and 
assume that the following conditions also hold: 

(2) rnHl/el)^nel, 

(3) rnln{Kn) ^ nel, 

(4) f„ In D (f.„ \/ nelBn) ^ nel , 
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(5) l<rn<rn<Kn, 

(6) l^rn^Kn, 

(7) A(r„)-<4, 

(8) Birn) -< nel 

(9) rnlnB{rn) '<nel. 

Denote d{p,p*)'^ = J J \p{y,x\j,(3y)^/'^ — p*{y,x)^/'^\'^iyy{dy)i'x{dx). Then we 
have the following successively stronger results: 

(i) for some tq > 0, 

lim P*{Tr[d{p,p*) < > 1 - e"''"'"""} = 1; 

n— »oo 

(ii) for some ci > 0, and for all sufficiently large n, 

P*{TT[d{p,p*) > e„|Z)"] > < e-0-5^i"^«; 

(iii) for some Ci > 0, and for all sufficiently large n, 

i?B„7r[d(p,/)>e„|Z)"]<e-^i"^". 



The above condition on D{-) = D{rny nef^Bn), when considering a specific 
example of GLM, depends on how |a'(/i)| and \'ip'{h)\ grow with the hnear 
parameter h. We will consider the following examples here. 

(a) Poisson regression with log linear link: mean = e^, y G {0, 1,2,.. .}. 
Then 



f{y, h) = —fiy = exp{hy - e^ - ln(y!)}. 
"/I 



e 

Here a{h) = h, a' = 1, Tp{h) = e^. So both \a'\ and grow at most expo- 
nentially in \h\. 

(b) Normal linear regression: mean /i = /i; variance = Lp~^ G 3f?^ is 
assumed to be known for now; y £?R.. Then 

,2 ,„„,2 



exp|v9/iy- ^ - ^ - ^ln(27rv3 ^)|. 



Here a(/i) = y^/i, a' = (p, ip{h) = h. So both \a'\ and IV'I grow at most linearly 
in \h\. 

(c) Exponential regression with log linear link: mean = e^, y G (0,oo). 
Then 

/(y, /i) = ^i-^e-y'^' = expl-e'S - h]. 
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Here a{h) = —e~^, a' = e~^, if) = . So both \a'\ and grow at most 
exponentially in \h\. 

(d) Binary logistic regression: mean /i = e^/{l + e^), y G {0,1}. Then 

/(y, h) = f^^il - fi)'~y = exp{hy - ln(l + e'^)}. 

Here a{h) =h,a' = 1, ^{h) = e^/{l + e^). So both \a'\ and IV'I are bounded 
above by 1. 

(e) Binary probit regression: mean fi = ^{h) = J^^{e~^^^'^ /y/2TT) dz, 
y £ {0,1}. Then 

f{y, h) = /.^(l - fif-y = exp{y ln($(/i)/(l - ^{h))) + ln(l - ^{h))}. 

Here a{h) = ln($(/i)/(l - a' = + {1 - $(/i)}~^]$'(/i), i){h) = 

G [0, 1]. By using Mills' ratio, it can be shown that |a'(/i)| increases at 
most linearly with \h\. 

Using these rates of growth, we can make the condition on D[-) more 
specific for specific examples of GLM. 

The conditions of Theorem 1 also depend on how the eigenvalues of 
and V^"^ behave. To be specific, assume that the largest eigenvalues of 
and V^^ , for I7I < f„, are both bounded above by some power {v > 0), 
for all large enough f„,. 

The condition on r„ In i3(r„) then becomes redundant since r„ In i?(r„) < 
fn In Bn < cfn In fn < cfn In Kn ~< UE^ (for some constant c > and for all 
large enough n), since B^ is bounded above by a power of f„. 

The condition on f„ln(l/e^) also becomes redundant [they are implied 
by the condition on f„ln(Er„) and ne^^ >~ 1] if we assume that Kn >- for 
some 5 > 0. 

Consider now the condition on f„lnL'(-) for various regression models, 

depending on the rate of growth D{fn\l ne^Bn). This condition on f.„ InD(-) 
becomes redundant [it is implied by the condition on f„ln(Er„)] for normal 
linear regression, binary logistic regression, and probit regression since D[-) 

is bounded above by some power of \J f'^'"ne\, which is bounded above by 

some power of Kn (note that f.„ < Kn, < 1 and Kn >- for some 5 > 0). 
The condition B{rn) -< ne^ can be satisfied by requiring f„ ^ (ne^)^/^'. 
For Poisson and exponential regressions with the log-linear link, however, 

D{-) grows exponentially in \J f^'^^ne^. The condition on f„lnD(-) then 

cannot be ignored, and it can be satisfied if f„ -< [ne'l^Y/^'^'^'"^ [which actually 
implies the later condition -B(r„) -< ne^ and makes it redundant]. 
These are summarized as follows. 

Theorem 2. Assume that the prior specification in Section 3 is used, 
such t/iai max{sup^.|^|<^^ chi(V'^), sup^.|^l<^^ chi(V'^~-'^)} < Bf"^ for some pos- 
itive constants B and v, for all large enough fn- Suppose \xj\ < 1 for all j, 
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and lim„_>oo X]f" \ < 00, where Kn is a nondecreasing sequence in n and 
Kn y for some 6 > 0. 

Let En he a sequence such that En G (0, 1] for each n and he'^ >- 1 and 
assume that the following conditions also hold: 

(10) fnHKn)^nEl, 

(11) 1 < r„ < f„ < Kn, 

(12) 1 -< r„ ^ Kn, 

(13) A(r„)^ inf ^ \P*\<eI. 

Also assume that 

(14) fn<{nElfl'' 

for normal linear, binary logistic and binary probit regression; or assume 

(15) r-„-<(n4)V(4+-) 

for Poisson or exponential regression with log-linear link function. Then the 
results of Theorem 1 hold. 

This result can be used to study the convergence rate En under various 
situations, depending on how Kn grows with n, as weh as how A(r„) = 
inf|^|=r^ Sj^7 I grows with r„. Here are some coroharies, which fohow by 
assuming an exponential decay rate of A(-) and checking the conditions of 
Theorem 2. This includes as a special case only a fixed and finite number of 
1/9* I 's being nonzero, while also allowing a more realistic setup with many 
small |/3||'s, none of which is exactly zero. 

Corollary 1. Consider the examples of Poisson regression, exponen- 
tial regression, normal linear regression, logistic regression or probit re- 
gression described before. Assume that the prior specification in Section 3 
is used, such that max{sup^.|^|<j;^ chi(VCy), sup^.|^l<j;^ chi(V^~-^)} < B7% for 
some positive constants B and v, for all large enough fn. Suppose \xj\ < 1 
for all j . Suppose Kn >- for some 5 > and Kn < e*^""^ for some C > 
and some ^ G (0, 1), for all large enough n. Suppose lim„_>oo Z^j^i \ < 00. 
Also suppose for some C > 0, A(r„) < e~^ ^" for all large enough n, and 

(16) {C')-^\nn<rn<fn<{\nnf 

for some k>l. Then we can take the convergence rate in Theorem 2 as 

(17) e„~n-(i-«)/2(lnn)'=/2. 
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Remark 1 {Good convergence rate). Note that n" -< e*^"* for any small 
^ > and large a > 0. So if Kn ~ n'^ for whatever large power a, one can 
achieve a convergence rate e„ ~ n~''^~^)/^(lnn)'^/^ -< n~(-^~^^)/^, where ^ can 
be made arbitrarily close to zero. This gives a rate arbitrarily close to the 
"finite-dimensional" rate Xj^fn^ despite the large dimension K^. We note 
also that these results suggest slowly growing r and f„ between powers of 
Inn, for achieving a near finite-dimensional convergence rate. Since these are 
only sufficient conditions, it may be possible that other ranges of r and f„ 
can also lead to a near finite-dimensional rate of convergence. The following 
result, for example, shows a good convergence rate even when r and f„ grow 
slowly in some small power of n. 

Corollary 2. Consider the setup of Corollary 1. For any b £ {0,q), if 
instead of (IQ) we have 

(18) (C")~Mnn<r„<f„^n^ 
then we can take the convergence rate as 

(19) e„~n-(i-€-^)/2. 

Here the power q = min{l — (,,S,(,/{3 + v)} for Poisson and exponential re- 
gressions with log-linear link function; q = min{l — ^, 6}I[v < 1] + min{l — 
^,(5, — l)}/[i' > 1] for logistic, prohit and normal linear regression. 

Remark 2 [Posterior consistency). The results on posterior consistency 
can be obtained as a special case by setting en = £ for any small but fixed 
e > 0. There is no need to assume a rate for A(r„) for consistency results 
to hold, since A(r„) ~< as long as r„ ^ 1 and lim„^ooZ]f" < co- The 
previous Theorem 2 then implies that the following condition on and f„ 
is sufficient for posterior consistency: 

(20) 1 -< r„ < fn < min{K„,ni/("+4),n/(lni^„)}. 

A slightly more relaxed condition for consistency for the special cases of 
logistic and probit regression can be found in Jiang [15]. 

Remark 3 [Normal linear regression with unknown dispersion). So far, 
for normal linear regression, we have assumed that y\x ^ N[E[y\x),ip~^) 
with dispersion parameter (inverse variance) (p[> 0) known. In practice, 
99 is unknown and a gamma prior is often put on ip (e.g., George and 
McCulloch [7], Kohn, Smith and Chan [17] and Dobra et al. [5]). For ex- 
ample, suppose conditional on model 7, ip\^ ~ Ga[K, p) with prior density 
TT[(f\-f) = p'^(p'^-^e-P'^/T[K), P^\-f,(pr^ N[0,(f-^V^) and 7 follows the prior 
distribution of Section 3. With this prior specification, it can be shown 
that the statements regarding normal linear regression in Theorem 2 and 
Corollaries 1 and 2 are still valid, where we consider bounded covariates 
standardized such that |rEj| < 1 for all j. 
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5. Implications of posterior convergence. It is well-known that a con- 
vergence statement such as 

(21) hm P*{^n[d{p,p*) < > 1 - e-'-«"^"} = 1, 

n — *oo 

for some tq > 0, implies existence of point estimates of p* that have the same 
convergence rate e„ in the frequentist sense. Such a point estimate can be 
obtained by finding the center of an £^-;2~hall with high posterior probability, 
or by posterior expectation (e.g., Ghosal, Ghosh and van der Vaart [10]). 

A point estimate can also be formed by a generalization of posterior ex- 
pectation called a "selected posterior estimate" (Jiang [15]). For example, 

(22) pA = J pTTAidplD''), 

where TrA{dp\D'^^) = Tr{dp\D'^\p E ^4), and p £ A is a selection rule, possibly 
data dependent. A rule of this kind, for example, can be averaging over sev- 
eral of the best models, which are indexed by 7's having the largest marginal 
posteriors 7r(7|D"). For example. Smith and Kohn [23] considered the use 
of the best model, and Sha et al. [22] averaged over the ten best models. A 
rule can also be defined by using the models that include the individually 
strongest variables. For example, include a model 7 in the posterior average 
if 7 includes a variable j that appears more than 5 percent of time in the 
posterior distribution [i.e., if 7r(7j = 1|D") >0.05]. See, for example, Lee et 
al. [19]. 

Suppose a rule A has selection probability tt{p G ^|Z)"} > r for some 
constant r > 0. Then the convergence rate of pA can be studied by using the 
relations 

(23) d{pA,P*f <el + 27r[d{p,p*) > en\D^]/r 
and 

(24) P*[d{pA,p*f <el + 25n/r] > P*[7r{d{p,p*) > e„|Z)") < 5„], 

which follows a familiar treatment based on convexity of p ^ d{p,p*)'^ (e.g., 
Ghosal, Ghosh and van der Vaart [10]). The term 5n/T can usually be taken 
as e~^°"^" [see result (i) of Theorem 1], which is negligible compared to 
under conditions f„ln(l/e^) -< ne^ and 1 < f„ of Theorem 1. 

For regression purposes, a related mean estimate can be constructed as 
fiA{x) = / ypA{y\x)vy{dy). When binary response y is considered, a classifier 
can be defined as Ca{x) = I{jlA{x) > 0.5]. 

In the general case, there is no relationship bounding the L2 distance 
Exii^A — fJ-*)'^ between the estimated mean and the true mean using d{pA,p*), 
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because the latter is bounded but the former is not. However, a weighted L2 
difference can be bounded as 

(25) / ^^.""'/T i^Adx) < 2d{pA,P*)\ 

where v*{x) = J y^p* {y\x)vy{dy) and VAix) = J y'^pA{y\x)vy{dy) . This is ob- 
tained by noting that {[ia - ^J*f' = {/l/(\/pI+ • (Vpa - Vp*)'^y{dy)}'^ 
and applying the Cauchy-Schwarz inequality. 

Since the denominator is at most 2 for binary y, the above relation actually 
leads to a bound for the unweighted L2 distance between the means, which 
further leads to a bound for the classification error due to Corollary 6.2 of 
Devroye, Gyorfi and Lugosi [4]. This is summarized below and was used in 
proving regression and classification consistency in Jiang [15]: 

EhnPl,,y^{CA{x)^y\D^)-L* 

(26) 

< Ehn2^ E^ifiA - fi*? 

(27) < Ad{pA,P*? < 2^AE*^„d{pA,p*y 

(28) < 4y^e2 + 2E*^„7r[d{p,p*) > e„|D-]/r. 

[The last step is due to (23).] Here L* = P(^^y){C*{x) / y} is the "Bayes 
error," where C*(x) = I[fi*{x) > 1/2] is the ideal "Bayes rule" based on 
the (unknown) true mean function /i*. According to Theorem l(iii), the 
term E'^nTr[d{p,p*) > e„|-D"] can be made exponentially small (of the form 
g-cine^ gome ci > 0), which is negligible when compared to as com- 
mented earlier. This implies that the error of the classification rule Ca{x) 
is at most 5e„ above that of the optimal Bayes rule, for all large enough 
n. So £n also forms a rate of convergence to the optimal Bayes error for 
the purpose of classification. Here the convergence rate e„ can be made to 
be near "finite-dimensional" (nearly 1/y/n) by Bayesian variable selection, 
despite a high dimension Kn ^ n" ^ n, in situations commented on earlier 
(e.g.. Corollary 1 and Remark 1). 

These convergence rate results show that even in high dimensions with 
dim(x) ^ n, a good convergence rate can be achieved when the effect of x is 
"sparse." For such sparse problems Bayesian variable selection can therefore 
help to alleviate "overfitting" or the "curse of dimensionality." 

6. Gaussian variable selection and graphical models. In this section we 
will assume that xf" = (xi, . . . ,XK„,y) are multivariate Gaussian and have 
been standardized to have E{xk) = 0, var(xfc) = 1. Here y is regarded as 
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where J„ = Kn + 1 . The effects of x^^j on Xj are summarized by the regres- 
sion coefficients /?*|^ used in the induced relation E{xj\xk:^j) = J2kj^j Pj\k^k- 

In Gaussian graphical models, relations among x/" are described by a 
graph, such that a node corresponding to xj is only connected to a "neigh- 
borhood" {xk)k£nbj, where nbj is a subset of {1, . . . , J„}\{j}, which indi- 
cates selected variables used in regression modeling of Xjlx^^j. Therefore, 
the Bayesian variable selection technique can be used for studying the neigh- 
borhood of a variable Xj (see, e.g., Dobra et al. [5]). We will consider the 
situation when none of the effects of x^^j on Xj is exactly zero. In this 
case, the usual consistency of selecting the "true graph" (e.g., Meinshausen 
and Biihlmann [20]) will not be studied here, since the true graph is the 
saturated graph adopting all Kn variables Xk^j to explain each xj. In the 
high-dimensional case Kn ^ n, such a "true model" is obviously not very 
useful. Nevertheless, in such a situation, Bayesian variable selection can still 
be shown to produce "good" models that are much simpler and yet are still 
"consistent," if the effects of these Kn variables decay sufficiently fast (when 
ordered in some way). Here "consistency" is in a different sense — these sim- 
plified models, picking up only a small number out of all the Kn nonzero 
regression coefficients, will be consistent in terms of producing probability 
densities "often close" to the true probability density. In this approach, one 
first uses Bayesian variable selection to obtain such "good" density esti- 
mates, for all p*{xj\xkj^j), j = 1, . . . , Jn] then one can construct graphs to 
summarize the conditional independence structures corresponding to these 
"good" density estimates. (One can systematically decide to either include 
or exclude one-sided connections in these graphs (see, e.g., Meinshausen and 
Biihlmann [20]) when some Xk is used in modeling Xjlx^^j but xj is not used 
in modeling Xk\xj^k-) 

We are interested in making inference on J„ (= Kn + 1) (conditional) 
densities p*{xj\xk^j), j = 1, . . . , J„, in order to construct a graph. We hope 
that the P* probability for not reliably estimating each density is small 
enough so that the P* probability is small for any density to be badly 
estimated. In other words, we would like to have a bound of P* probability 
of large errors. For now, pick any Xj as the response y and consider its 
regression on the x^'s {k ^ j). To mimic the regression setup, we can reorder 
the indices of the We will use the prior specified in Remark 3. 

A result as in Theorem l(ii), obtained when assuming uniformly bounded 
jxfcj's, could be used for this purpose of bounding the total error out of the 
Jn regression analyses. 

In the current situation of Gaussian graphical models, however, the 
Xfc's are Gaussian instead of being uniformly bounded. In this case, for re- 
sult (ii) of Theorem 1 to hold, we will change the condition on ^(r^) — 
infj^j=r_^ J^kf-Y \Pk\ from A(rn) -< to if„A(r„) -< e^. This would be satis- 
fied if A(rn) decays exponentially fast in r„, r„ >- Inn, and Kn grows at most 
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polynomially. After taking into account some other conditions, we obtain the 
following theorem. 

Theorem 3. Consider the prior specification in Remark 3. {When se- 
lecting the neighborhood for each xj, treat xj as y and Xk^j as xf"" .) Assume 
that 

max< sup chi(V^), sup clii{V~^) > < Br"^ 

l7:|7|<f,j l-h\<rn ^ 

for some positive constants B and v, for all large enough fn- 

Suppose that, for each xj, the effects of the other variables Xk^j sat- 
isfy lim„^ooEfceiCj \(^j\k\ < where ICj = {1, . . . ,i^„ + l}\{i}. In addition, 
assume that there exists some C > 0, such that, for all large enough n, 
inf7c^,.,|7|=r„ Efce^A7 \(^j\k\ ^ e-^'*^". 

Assume that -< Kn -< n°' for some a> 5 > 0. 
Assume also for some ^ G (0, 1) 

(29) Inn -< r„ < f„ -< n'', where b < mm{5, ^, ^/v}. 

Then we have, for some constant Ci2 3> 0, for all sufficiently large n, 

(i) 

P*[7r{hj < n-(i"«)/2|£,n^ > ^ _ e-^[nij = 1, . . . , K„ + 1] > 1 - n^e'^^"^ 
and, , 



u 



P*[hj,A, < 4n-(i-«)/2, J = 1, . . . + 1] > 1 - n^e-'^^'. 



Here we define, for j = 1,. . . , Kn + 1, 

(30) hj = \ [ \p{xj\xk^j)^/^ - p*{xj\xk^j)^/^\^p*{xk^j)dx'^"-^^y^^ ^ 
hjAj = \pA,{xj\xk^jf''^ -P*{xj\xk^jf''^\^ 



(31) 



^ 1/2 

Xp*{xk^j)dx^-+^\ , 



where p* represents the true density and pAj is a selected posterior esti- 
mate [as defined in (22)] corresponding to a selection rule Aj, such that the 
selection probability 7r(p S Aj\D^) > r for some r > 0. 

Therefore, a near finite-dimensional rate of convergence can be achieved 
(for some small ^ > 0), jointly for all neighborhoods of Xj, j = 1, . . . , Kn + 1, 
despite the fact that Kn can follow a large power of n. 
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7. General prior. In this section we consider the case \xj\ < 1 for all j 
and mainly focus on the GLM models as described in Section 2, where a{h) 
and b{h) contain no additional parameters other than h. (Similar conditions 
and results can be formulated for normal linear regression with unknown 
error variance.) 

Here we consider the general conditions on the prior 7r(7, /3^) for producing 
rate of convergence s^i^ which is a sequence in tx, which we assume to satisfy 
En G (0, 1] for Conditions (N) and (O) below. 

Condition (N) requires a not too little prior to be placed over a very 
small neighborhood of the true density p* . Condition (O) requires a very 
little prior to be placed outside of a region that is not too complex in some 
sense. 



Condition (N) (For prior n on an approximation neighborhood). As- 
sume that a sequence of (nonempty) models exists such that, as n in- 
creases, 

(32) ei/5;h4, 

and for any sufficiently small rj > 0, there exists A^^ such that, for all n > A^, 
we have 

(33) 7r(7 = 7n)>e-"^"/« 
and 

(34) 7^(/3^GM(7„,r/)|7 = 7n)>e-"^'/^ 
where M{jn,v) = {Pj ± ??4/l7n|)ie7n- 

Condition (O) (For prior tt outside of a not-too- complex region). Let 
D{R) = H-i?sup|;j[<^ |a'(/i)| •sup|;j|<j:j |V'(^)| for any > 0. There exist some 
C„ > and some f„ satisfying 1 < f„ < Kn , such that 

(35) f„ln(l/4) ^^4, 

(36) rnlnKn^nel, 

(37) fn\nD{fnCn)<nel. 

Furthermore, for all large enough n, the following two equations hold: 

(38) vr(|7|>f„)<e-^"^", 
and for all 7 such that I7I < f„, for all j G 7, 

(39) 7r(|/3,-|>C„|7)<e-^"^". 
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These conditions allow a larger variety of priors. For example, one can use 
a uniform prior of 7 over all models with complexity I7I < f^, where fn can 
be taken to follow some rate of growth depending on the convergence rate 
En desired, and depending on the "bias" rate A(r) = inf^. X^j^-y 

Before we truncated tt{^) such that '7r[|7| > f„] = 0. 'this may not be 
desirable since we are forbidding the model to be too complex in the prior. 
We here notice that this truncation is not necessary. We can allow the prior 
to propose very complicated models with large I7I, as long as the prior 
probability of I7I > f ^ is sufficiently small. 

Theorem 4 (Convergence rate under general prior). For GLM mod- 
els with bounded covariates \xj\ < 1 for all j, suppose the true regression 
coefficients satisfy lim„^ooZ]j=i \ 

Let En G (0,1] be a sequence such that ne^ 00. Denote d{p,p*)'^ = 
I / |p(y)3;|7,/3^)^/^ — p*(y,2;)^/^pt'y((iy)i/2:((ix). If the prior specification sat- 
isfies both Conditions (N) and (O), then we have the following (successively 
stronger) results: 

(i) 

lim P*{7rn[d(p,p*) < 4e„|D"] > 1 - 2e-"="/^} = 1. 

n — ^00 

(ii) For all sufficiently large n, 

P*{Tr[d{p,p*) > 4e„|D"] > 2e-"^"/^} < 26""^"/^ 

(iii) For all sufficiently large n, 

E})^TT[d{p,p*) > 4e„|Z)"] < 4e-"^'/2^ 

Results (i), (ii) and (iii) of this theorem will be proved by verifying some 
sufficient conditions for posterior convergence (to be summarized at the 
beginning of Section 8). These results, with bounded Xj's, will then be used 
to prove all the previous results on convergence rates, when specific priors 
as given in Section 3 are used. The only exception is the result in Section 6, 
where Xj's are jointly normal; they will be obtained by directly verifying the 
sufficient conditions in Section 8. 

These conditions below are based on the Hellinger metric entropy and will 
be used to obtain posterior convergence rates under the GLM framework. 
Note that the method involved here is different from that in Jiang [15] , who 
uses the Hellinger bracketing entropy and its upper bound of a parametric 
covering number (see, e.g., Theorem 3, Lee [18]). That method does not di- 
rectly apply to modeling unbounded responses such as Gaussian and Poisson 
responses. (When applied to, e.g., Poisson regression, the upper bound of 
the bracketing entropy would require a too small restricted parameter space, 
on which the prior would place a nonnegligible probability.) 
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8. Proofs. We first use a proposition to summarize a set of sufficient 
conditions for establishing rates of posterior convergence. These serve as 
just one possible set of working conditions that we find convenient to use 
here, through which we have established our results; there exist several other 
alternatives, possibly with more relaxed conditions, for example, in Ghosal, 
Ghosh and van der Vaart [10] or Zhang [26]. 

Suppose Vn is a sequence of sets of probability densities. (For each n, 
denote as the complement — the set of densities not in Vn-) Suppose e„ 
is a sequence of positive numbers. 

Suppose N(en,Vn) is the minimal number of Hellinger balls of radius e„ 
that are needed to cover Vn- [I-e., N{en-,Vn) is the minimum of all k such 
that there exist Sj = {p : d{p,pj) < En}, j = 1, ■ ■ ■ ,k, such that Uj=i ^ 

Vn, where d{p,q) = J (-^/p — -y/g)^ denotes the Hellinger distance between 
densities p and q.] 

Let the components of = {w^^\ . . . ,'u;^"^) be i.i.d. with true density p* , 
where dim{w^^^) and p* can depend on n. Denote -7r(-) as the prior (which is 
allowed to depend on n by using, e.g., an increasing number of parameters to 
parameterize the density as n increases), tt{-\D^) as the posterior and 7r(e) = 
TT[d{p,p*) > e|D"] for each e > 0. Define the KL difference as dQ{p,p*) = 
J p*ln{p*/p). Define also a dt difference as dt{p,p*) = t~^{J p*{p*/pY - 1) 
for any i > 0, which is used in, for example, Wong and Shen [25]. (Note that 
dt decreases to do as t decreases toward 0.) 

Denote P* and E* as the respective probability measure and the expec- 
tation for the data D". 

Define the following conditions: 

(a) In N{£n,Vn) < ne'^ for all sufficiently large n; 

(b) '7T{Vn) < e~^"^" for all sufficiently large n; 

(c) for all small enough 7 > and r > 0, there exists N^^r such that for 
aU n > N^^r, iT[p:do{p,p*) < 76^] > e"™^'; 

(d) Tr[p:dt{p,p*) < £n/^] — e""^"/^ for all sufficiently large n, for some 
t>0. 

Proposition 1. Suppose nen>- 1- Then, under (a), (h) and (c), we 
have: 

(i) 

lim P*[7r(4e„) < 2e-'^=>''^^i'''/2}l = i; 

n— >cxD 

under (a), (h) and (d) (for some t>0), we have 
(ii) 

P*[7r(4e„) > 2e~"'^"™™'f^/^'*/^^] < 2e~"^"™'^'f^/^'*/'^^, 
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(iii) 

The proof of this proposition follows the spirit of Ghosal, Ghosh and van 
der Vaart [10]. The details are omitted here and are included in a technical 
report (Jiang [14]). 

Proof of Theorem 4. We prove result (iii) only, since it imphes (ii) 
by Markov's inequality, which further implies (i). Result (iii) is proven by 
applying Proposition 1 with t = l. The proof is completed by checking con- 
ditions (d), (a) and (b) below. 

Checking condition (d) for t=l. Denote the GLM density as f{y,h) = 
exp{a{h)y + b{h) + c{y)}. Thenp* = f{y,h*), where h* = x'^ (3* = J2f=iXjP* ■ 
Let = f{y, h^), where = xj^j3-f = J^je^y^ ^jPj^ where jn is the model in 
Condition (N). 

When h* and /i^ are close enough, dt{P'y,p*) (for t = 1) can be put in 
a form dt{pj,p*) = Exg{hJ-){h* — h^), by integrating out y and applying a 
first-order Taylor expansion. Here g is a continuous derivative function in a 
neighborhood of h* and /i* is an intermediate point between h* and h^y. Note 
that \h' - h*\ < \h^ -h*\<\ Zjf^„ xj(3*\ + I Eje7„ xjiPj - < A„ + r„5„, 
when the Xj's are bounded by 1 and (3j £ (/?* ± 6n) for all j £ jn- Here 
'"n = |7n| and A„ = Y^j^^^ is assumed to satisfy A„ -< e,^. 

For sufficiently small VnSn, \gih'^)\ is bounded since < + \h'' — h*\ < 
Bo + An + rnSn is bounded, where Bq = lim„^ool]j^i Then dt{p^,p*) < 
f-n^n) for some constant C, for all small enough r^^n- 

We will take (5„ = 7/e^/|7n| for some small enough r/ > 0. This will make 
c^t < 5^/4 for all large enough n, since A„ -< e^. 

This implies that the set of densities S = {p(-|7n, /?) : /? G (/?| ± (5n)je7„} 
contained in T = {p: dt{p,p*) < e^/4}. The conditions on 7r(7„) and 7r(/3 G 
iP* ± Sn)j€-/Jjn) then imply that tt{T) > tt{S) > e""^"/^ for ah large n, 
confirming condition (d). 

Checking condition (a) . Each density p is labeled by a model index 7 and 
the corresponding regression coefficients Pj. We will define Vn as the set of 
densities that can be represented with I7I (the number of nonzero regression 
parameters) being at most f„, and with each parameter \ < Cn- 

The corresponding space of regression parameters can be covered by small 
£00 balls of the form B = {vj it 5)j^i, of radius 6 > 0. For each model 7 in 
Vn, there are I7I nonzero components of f3j, valued in iC^. It takes at most 
[(2C„)/(2(5) -|- I]!'''! balls to cover the parameter space of model 7 in Vn- [The 
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centers of these balls can be taken inside the parameter space of model 7, 
so that each center v = {vj)f" has components satisfying vj = Vj ^ 7 and 

l^^il < Cn Vj G 7.] 

There are at most K^^ models of size I7I = r, and r = 0, 1, 2, . . . , f„. These 
show that N{6), the number of size-5 balls needed to cover the space of 
regression parameters for Vn, is at most J2r=Q K'!^[{2Cn) / (26) + 1]*", which is 
bounded above by (f„ + l)(if„(C„/(5 + 1))*'". 

Given any density in Vn, it can be represented by a set of regression 
parameters {uj)f" falling in one of these A^(5) balls, say, ball B = [vj ± 5)f^i , 
where uj and Vj are zero for the same set of components 7, where I7I < f„. 

Consider the corresponding GLM densities pu,v = exp{ya(/i„^„) + + 
c{y)}, where hu = J2j=i''^j^j ^'^cl hy = J2j=i''^j^j- Then the Hellinger 
distance d{pu,Pv) < {do{Pu-,Pv)}^^^ ■, where the KL difference do{pu,Pv) = 
Ex J Pvi^npv — ^npu)i'y{dy). After integration in y, one can apply a Tay- 
lor expansion and show that do{pu,Pv) < Ex{a' {h^)'4'{hy) + 6'(/i*))(/i„ — hy), 
where ip = —h'ja! and K is an intermediate point between and h^. Note 
that u and v both have components bounded in value by Cn and they have 
zero components out of a same set, say, 7, such that I7I < f„. Therefore, 
hv and hu (and therefore, also W) are bounded above by fnCn- Note also 
that \hy — hu\ = I ^j(z^Xj{vj — Uj)\ < fnS, since \xj\ < 1, \vj — Uj\ < 5 and 
I7I < ^n- Therefore, 

(40) doiPu,Pv)<2 sup \a'{h)\ sup \'ip{h)\rnS 

\h\<fnCn \h\<fnCn 

and 

r 1 

(41) dipu,py)<i2 sup \a'{h)\ sup |^(/i)|r„5^ . 

^. \h\<rnCn \h\<rnCn ' 

So d{pu,Pv) < En if 5 = e^/{2sup|;,|<^,^c'„ W{h)\ sup|fc|<^^c'„ li^Wlfn}. There- 
fore, density pu in Vn falls in a Hellinger ball of size En, centered at p^. 
There are at most N{6) such balls, because each center p^ is the density 
corresponding to the parameter v, which is the center of B, one of the at 
most N{6) balls used to cover the restricted parameter space. 
Therefore, the Hellinger covering number 

N{en,Vn)<N{6) 

(42) <{rn + l)K"(l + 2e-^ sup \a'{h)\ sup |V'(/i)|r„C„) " 

\ \h\<rnCn \h\<rnCn / 

<{2KlD{rnCn)/elY", 

if < <^ 1 and 1 < f„ < Kn- Therefore, the conditions in Condition (O) 
guarantee that In A''(e„,7^„) -< ns'^ for all large enough n, proving condition 
(a). 
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Checking condition (b). For the Vn defined above, the prior on the 
complement it(V^) < 7r[\-f \ > f„] + E7:|7|<f„ 7r[7]7r(Ujg^[|/?j| > C„]|7), which 
is at most 7r[|7| > f„] + max^.|^|<fr^^ 7r(Ujg^[|/3j| > C„]|7). This is, due to 
Condition (O), at most (1 + fnje-"^"^" = gin{i+f„)-4n4 < exp(-4n4 /2) for 
all large enough n. Here we have used 1 < f„ < K^, so that ln(l + fn) < 
fn^TiKn -< ne\ due to Condition (O). This proves condition (b). □ 

Proof of Theorem 1. We apply Theorem 4 with e„ replaced by e^, 
so that the Hellinger neighborhood will take a radius 4e^. This can be later 
rescaled to obtain the results in Theorem 1 concerning a radius e„ , by setting 

£n — 4^^ or £jj — 6,^/4. 

For Condition (O): with the prior in Section 3, the condition on 7r[|7| > f„] 
is trivially satisfied, since it is zero due to truncation. We will take C„ = 



y Bfine'^i^ so that the condition on r„lnD(r„C,i) is satisfied. The condition 
on 7r[|7j| > Cn\j] is checked by using Mills' ratio. It is at most 2e~^"/^'^^"^ / 



enough n, as required by Condition (O). Here is an upper bound on the 
prior variance of (ij under model 7 with I7I < and ne^ >- 1. All other 
conditions in Condition (O) are satisfied. 

For Condition (N): Take the sequence of models 7^ such that, for each n, 
7 = 7n reaches its infimum in A(r„) = inf^.|^|=r„ Hj-.j^-y 1/3^1- Then Ej^7„ \f^*j \ ■ 

For the condition on the prior tt[(3 S (/3J±7ye^/r„)jg^^|7„], use the normal- 
ity of the prior and obtain the lower bound |27rKy^ |~^/^e~'''^^^^'>'" ^(r/e^/r„)^" 
for some intermediate value (3 achieving the infimum of the density over 

Note that < ||/3||25(r„) < (E^g^^ l/JjD'^K) < CiS(r„) for some 
constant Ci > 0, since the eigenvalues of V~^^ are at most B{rn) (for all large 
enough n), and the Euclidean norm \\(3\\ < J2j(z^„ < lim„^ooEj^i + 
Tn-qel/rn is bounded. Note also that \2ttV^J-^/'^ > e-C2rn-C3r„inB(r„) 
some constant C2 and some constant C3 > 0, due to the eigenvalues of V~^^ 
being bounded above by B[rn) (for all large enough n). 

Therefore, 

7r[/3 G (/?* ± ??4/?'n)ie7„|7n] 

(43) 

> exp{-C2r„ - C^rn\nB{rn) - 0.5CiS(r„) - r„ ln(r„/(r/e^))}. 

2 

This will be greater in order than any e"'^"'''" (c > 0), satisfying a requirement 
of Condition (N), since r„, r„ In i5(r„) and B{rn) are all smaller than ne^ 
in order, and so are rnlnvn < fn^^Kn and r„ln(l/e^) < f„ln(l/e^). 





therefore less than e 



n/4 = g 4n(e„)2 large 
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Now consider the condition on vr(7„). Note that the 7„ chosen is such that 
l7n| = i~n, where rn (< Tn) is the expected size of the model 7 = 7^ " proposed 
by the prior before truncation. The prior specification of 7^" (in Section 3) 
is i.i.d. binary with vr(7j = 1) = Vn/Kn- For the condition on 7r(7 = 7„) to 
hold, it suffices for us to show that (*) for any c > 0, 7r(7 = 7„) > e~'^"^" for 
all large enough n. This is because 7r(7 = 7„) cannot be smaller, since it is 
obtained by truncation of 7, and truncation increases the probability of all 
allowed configurations (note that |7„| <'r„,). 

Now \^n\ = Tn ^ implies that there are r„ out of Kn 7j's equal to 1, 
with the rest being 0. The probability is therefore 7r(7 = 7„,) = (r„/i('„)^"(l — 
rn/Kn)^'^~''"- Since rn/Kn -< 1, we have ln7r(7 = 7„) ~ r„ ln(r„/i^^„) > 
— r„lni^„ (r„ > 1), where r„lnX„, -< ne^. This leads to claim (*). □ 

Proof of Theorem 3. It suffices for us to prove (**) result (ii) of 
Proposition 1 in a regression setup for normal dispersion models with Kn 
Gaussian covariates. We will take ~ n~(^~'>^/^ for some ^ G (0,1). Then 
result (i) of Theorem 3 can be obtained by a union bound over Kn + 1 
regressions, treating each of the Kn + 1 Xj's in turn as the response y, and 
Result (ii) of Theorem 3 can be obtained by using bounds of 

the form (24). 

We prove (**) by directly applying Proposition 1 and verifying conditions 
(a), (b) and (d) (for t = 1). The details are omitted here and are included 
in a technical report (Jiang [14]) in order to save space. □ 

9. Discussion. Bayesian variable selection (BVS) handles high-dimensional 
regression by using a suitable prior to propose lower-dimensional models 
which select a few explanatory variables out of the many (Kn) candidates. 
For generalized linear models, we have shown that (see, e.g.. Remark 1) a 
near finite-dimensional can be obtained, even when the 

number of candidate variables Kn grows as any high power of the sample 
size n. Such a good rate e„ is derived assuming an exponentially decaying 
tail A(r„) = inf^.|^|=^^ X]j^7 l/^j I- This includes as a special case the situa- 
tion when only a fixed and finite number of true regression coefficients (/3|'s) 
are nonzero. On the other hand, it also allows more realistic situations with 
many small |/?||'s, none of which is exactly zero. The rates we obtain here 
are infinitesimally weaker than the finite-dimensional rate n~^/^. We suspect 
that the exact rate n~^/^ cannot be achieved in the setup that we consider, 
since the priors we use need to propose models of dimension r„ increasing to 
infinity as n increases (even though r„ <C n). This is for the purpose of be- 
ing able to approximate a true model to any precision. With such increasing 
model dimensions, we suspect that the exact n~^/^-rate cannot be achieved 
in any way. 
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Although we have only considered in detail the situation with an exponen- 
tially decaying A(-), the more general framework of, for example, Theorem 2 
allows us to treat other situations of A(-) as well. For example, when A(-) fol- 
lows an inverse power law, the can be somewhat slower. 
However, even in such situations, BVS can still exhibit some "resistance 
against overfitting" when Kn is large. Not only can posterior consistency be 
still achieved when lim,i_>oo Z^j^i < oo, but also the convergence rates 
will not be directly linked to the large dimension Kn — they will be related 
to the sizes of the |/3||'s instead. 

An Associate Editor raised the interesting question whether the sparse- 
ness conditions for the true regression coefficients can be extended to a form 
of £fc-summability for A; > 1 (such as £2)- We do not have a general answer, 
except in an analytically-friendly special case as follows: The true model is 
y ~ N{x'^P*, 1) (it can be extended to allow a dispersion parameter), such 
that Exx^ forms an identity matrix, or more generally, Exx'^ and its inverse 
both have bounded eigenvalues. The prior proposes fitted models of the form 
y ~ A^(x^/3^, 1), according to the prior specification in Section 3. For this 
example, by a treatment parallel to the current paper, we can accommodate 
(3* that is ^2-summable but not ^i-summable, such as (5* = (j~^)f", result- 
ing in a possibly slower rate for posterior convergence. On the other hand, 
when [5* = (j~^^^)f", which is ^fc-summable for k>2 but not ^2-summable, 
the current approach does not work. Roughly speaking, we would need to 
use very complicated fitted models of size I7I ~ Kn to approximate the true 
density, in order to obtain a nonzero prior probability over a small neigh- 
borhood of the true model. Then the complexity/entropy conditions [e.g., 
equation (10)], which imply |7|lnif„ -< n, could not be satisfied for such 
fitted models of size I7I ~ Kn in a high-dimensional setting Kn > n. 

Although the topic of our paper is Bayesian, it is noted that the use of 
^i-type conditions here is related to some other work in the frequentist ap- 
proach. Our paper is closer to Biihlmann [2] in the sense that both assume a 
true model satisfying some ii summability condition, while the fitted model 
(boosting for Biihlmann [2] and BVS for the current paper) does not use 
an ii constraint or penalization. The fitted models in this paper are pro- 
posed according to a prior that uses i.i.d. binary distributions (with a small 
selection probability) when selecting the candidate variables. This may be 
regarded as a nondeterministic way of penalizing the Iq norm of /? (or the 
number of nonzero regression coefficients) of the fitted models. On the other 
hand, in Greenshtein and Ritov [12], Greenshtein [11] and Meinshausen and 
Biihlmann [20], the fitted models (instead of the true models) are subject to 
an ii constraint or penalization. In the more general framework of persis- 
tence in Greenshtein and Ritov [12] and Greenshtein [11], the true models 
actually do not need to satisfy the £1 summability condition. 



24 



W. JIANG 



The current paper focuses on fitting a density with Bayesian variable 
selection (BVS). A referee raised some interesting questions about the use 
of BVS when the main goal is selecting the variables. In some sense the 
current paper does prove that the method of BVS will provide "good" sets 
of variables, based on which good predictive performance, for example, in 
classification, can be achieved; see discussion in Section 5. The paper focused 
on the more realistic situation when there is no simple true model with 
many zero regression coefficients. All variables may have some effects, more 
or less. So the problem is not to select a "true" model (which would be 
the full model) but a "good" model (possibly much simpler than the full 
model) that achieves good performance for prediction, regression or density 
estimation. In this sense the paper does address variable selection and shows 
that BVS provides "good" sets of variables with high posterior probability. 
What will happen when there does exist a small true model, for example, 
when some regression coefficients are bounded away from zero, while the rest 
are exactly zero? We conjecture that, with high probability, BVS will select 
all the "relevant" variables with nonzero regression coefficients, but it may 
also include some "irrelevant" variables, with small regression coefficients 
proposed by the posterior. A truncation scheme similar to thresholding may 
be used to screen out the "irrelevant" variables, if necessary. However, we 
leave this as an open question, since such a scenario, being more idealized 
but still very interesting, is not within the main scope of the current paper. 

Another future work may be to consider the (generalized) linear structure 
of the fitted models in a misspecified framework such as in Kleijn and van 
der Vaart [16], so that the true model may be nonlinear. On the other hand, 
one should note that nonlinearity may be treated even under the linear 
framework of the true model. This can be done by including higher order 
terms, interactions, regression spline terms with various knot-locations, and 
so on (see, e.g.. Smith and Kohn [23]). 
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